Part I - Ford GoBike Data:

Demographics of Bikers Who Use Ford Bikes

by Sofiyah Olaiwon

Introduction

This document explores a dataset which includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area during the month of February.

Preliminary Wrangling

What is the structure of the dataset?

The dataset contains data of approxiamtely 180,000 rows and 16 columns documenting instances of bikes rented by ford over a period of months. The data contains information in columns encoded in numeric, object and categorical data types.

What is the main feature of interest in the dataset?

I am most interested in the various characteristics of people who use bikes from the Ford GoBike System.

What features in the dataset will help support investigation into the features of interest?

I will be making use of the duration of time spent on the bikes in seconds, the longitude and latitude of both the the start and end stations, the user type, member birth year, member gender and bike share for all trips features to conduct my analysis.

Data Cleaning

Checking for data issues and cleaning accordingly.

Data Issues

Now the data has been cleaned, lets start exploring.

Univariate Exploration

Question

What is the distribution of riders based on gender?

Visualization
Observation

A large percentage of bikers are male who constitute to about 75% of the member population and females with about 23% and other genders with about 2%.

Question

Which User Type Ride Bikes More?

Visualization
Observation

Subscribers make use of the bikes than regular customers

Question

How likely are bikers to share a gobike during a trip?

Visualization
Observation

Most bike users prefer Not to share bikes during a trip.

Question

What is the age range of ford gobikes members?

Visualization
Observation

While we have riders with ages ranging from 20 to bout 80, a large percentage of bikers fall into the age range of 25 - 45.

Question

What period of time do most bikers spend cycling?

Visualization
Observation

Although most riders prefer to take short trips on the bikes, some bikers spend hours\ on bikes. This will be investigated in futher EDA analysis.

Question

Are some bike stations more frequented than others?

Visualization
Observation

The distribution plots of both start and end station ids look similar meaning some stations are more frequented than others.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

From the analysis done above, there is a gender imbalance in the data having about 75% male riders.
A large percentage of bike riders are subscribers and most riders prefer not to share rides during a trip

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The member age and duration had huge outlier values that could affect our analysis which were eventually dropped during the course of analysis. The histogram plot of the start and end station id were similar in nature which means that some bike stations are more frequented or accessible than others. I also performed a log transformation on the x axis of the histogram plot of duration sec which shows a normal distribution of bike ride durations overall.

Bivariate Exploration

In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).

Question

What is the distribution of the duration of time spent on ford gobikes across the month of february?

Visualization
Observation

The timeseries plot above shows a periodic distribution with high spikes showing that ta specific time of the days bikers take very long rides on the bikes and also at specific times there are little ore no use of bikes at that periods.

Question

Which categories of customers takes the longest rides.

Visualization
Observation

From the plot above it shows that on an average females and other genders take slighly longer rides than females although percentage of males who ride bikes are higher than others. Same with subscribers and customers, whereby customers have an average duration on bikes than subscribers who constititute more in the data. This could be due to the imbalance of the class/categorical data.

Lets look at a PairGrid and heatmap of numerical values in the data set to look for correlation between variables!

Seems that there is little or no correlation between most of the columns with each other apart from member birth year and member age which makes a lot of sense. There is also correlation between start station id and end station id

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

From the boxplot plot above it shows that on an average females and other genders take slighly longer rides than females although percentage of males who ride bikes are higher than others. Same with subscribers and customers, whereby customers have an average duration on bikes than subscribers who constititute more in the population. This could be due to the imbalance of the class/categorical data.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The timeplot plot series plotted above, there is periodic distribution of of duration at time periods showing that at specific time of the days bikers take very long rides on the bikes and also at specific times there are little ore no use of bikes at that periods.

Multivariate Exploration

Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.

Question

What is the average time spent on bikes by each gender type and does it vary based on whether they are subscribers or customers?

Visualization
Observation

Based on the user type, subscribers spend lesser time on bikes than customers. For subscribers, other gender spend the most time on bikes followed by females and then males. For customers, females spend the most time on bikes.

Question

What is the average time spent on bikes by each user type and does it vary based on whether they share bikes for all trips?

Visualization
Observation

Interstingly, no customers shared their bikes while on a trip. This may be due to the fact that maybe the bike sharing feature not being available to non-subscribers but since may be just a speculation as we dont have any information on this.

Question

Lets make use of our start station latitude, longitudes and ids, end station latitude, longitudes and ids to create a scatter map box that will show is the various areas that bikers start and end their rides.

Visualization
Observation

Based on the color distribution of station id on the scatter map box, there is a color pattern that shows that there is a relationship between location and station id or to better put it certain location are designated to a range of station ids.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

Based on the color distribution of station id on the scatter map box, there is a color pattern that shows that there is a relationship between location and station id or to better put it certain location are designated to a range of station ids.
Interstingly, no customers shared their bikes while on a trip. This may be due to the fact that maybe the bike sharing feature not being available to non-subscribers but since may be just a speculation as we dont have any information on this.

Conclusions

From this analysis, we can observe the following about the GoBike users.